Search CORE

UCL Discovery

FigShare

Mining the Gene Wiki for functional genomic knowledge

Author: A Subramanian
AI Su
Andrew I Su
AR Aronson
AR Pico
B Mons
Benjamin M Good
C Jonquet
D Weekes
Douglas G Howe
DW Huang
E Callaway
E Camon
EB Camon
ES Lander
H Stehr
I Rivals
J Osborne
JC Venter
JW Huss
JW Huss
L Hirschman
LA Flórez
M Ashburner
M Waldrop
N Daraselia
NH Shah
R Hoffmann
R Tirrell
R Winnenburg
Simon M Lin
W Baumgartner
Warren A Kibbe
Z Lu
Publication venue: BioMed Central
Publication date: 01/12/2011
Field of study

Abstract Background Ontology-based gene annotations are important tools for organizing and analyzing genome-scale biological data. Collecting these annotations is a valuable but costly endeavor. The Gene Wiki makes use of Wikipedia as a low-cost, mass-collaborative platform for assembling text-based gene annotations. The Gene Wiki is comprised of more than 10,000 review articles, each describing one human gene. The goal of this study is to define and assess a computational strategy for translating the text of Gene Wiki articles into ontology-based gene annotations. We specifically explore the generation of structured annotations using the Gene Ontology and the Human Disease Ontology. Results Our system produced 2,983 candidate gene annotations using the Disease Ontology and 11,022 candidate annotations using the Gene Ontology from the text of the Gene Wiki. Based on manual evaluations and comparisons to reference annotation sets, we estimate a precision of 90-93% for the Disease Ontology annotations and 48-64% for the Gene Ontology annotations. We further demonstrate that this data set can systematically improve the results from gene set enrichment analyses. Conclusions The Gene Wiki is a rapidly growing corpus of text focused on human gene function. Here, we demonstrate that the Gene Wiki can be a powerful resource for generating ontology-based gene annotations. These annotations can be used immediately to improve workflows for building curated gene annotation databases and knowledge-based statistical analyses.</p

Evaluation of a large-scale biomedical data annotation initiative

Author: Christian Hinske
DA Lindberg
E Pitzer
EB Camon
Erik Pitzer
H Parkinson
HW Lee
J Dudley
JW Fan
K Ikeo
KW Fung
L Pevner
Lucila Ohno-Machado
NH Shah
Pedro Galante
Ronilda Lacson
S de Coronado
T Barrett
T Barrett
T Barrett
WJ Wilbur
Publication venue: BioMed Central
Publication date: 01/01/2009
Field of study

Estimating the annotation error rate of curated GO database sequence annotations

Author: A Bairoch
A Vinayagam
Alfred L Brown
CE Jones
CH Wu
Craig E Jones
D Devos
D Groth
DM Martin
E Camon
EB Camon
H Xie
II Artamonova
M Linial
ML Green
MY Galperin
S Khan
SE Brenner
SF Altschul
Ute Baumann
WR Gilks
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2007
Field of study

Background Annotations that describe the function of sequences are enormously important to researchers during laboratory investigations and when making computational inferences. However, there has been little investigation into the data quality of sequence function annotations. Here we have developed a new method of estimating the error rate of curated sequence annotations, and applied this to the Gene Ontology (GO) sequence database (GOSeqLite). This method involved artificially adding errors to sequence annotations at known rates, and used regression to model the impact on the precision of annotations based on BLAST matched sequences. Results We estimated the error rate of curated GO sequence annotations in the GOSeqLite database (March 2006) at between 28% and 30%. Annotations made without use of sequence similarity based methods (non-ISS) had an estimated error rate of between 13% and 18%. Annotations made with the use of sequence similarity methodology (ISS) had an estimated error rate of 49%. Conclusion While the overall error rate is reasonably low, it would be prudent to treat all ISS annotations with caution. Electronic annotators that use ISS annotations as the basis of predictions are likely to have higher false prediction rates, and for this reason designers of these systems should consider avoiding ISS annotations where possible. Electronic annotators that use ISS annotations to make predictions should be viewed sceptically. We recommend that curators thoroughly review ISS annotations before accepting them as valid. Overall, users of curated sequence annotations from the GO database should feel assured that they are using a comparatively high quality source of information.Craig E. Jones, Alfred L. Brown and Ute Bauman

Adelaide Research & Scholarship

Metrics for GO based protein semantic similarity: a systematic evaluation

Author: A Schlicker
A Valencia
André O Falcão
António EN Ferreira
C Pesquita
C Wu
Catia Pesquita
D Devos
D Devos
D Faria
D Lin
Daniel Faria
E Camon
EB Camon
F Azuaje
F Azuaje
F Couto
F Couto
FM Couto
Francisco M Couto
Gentleman
Hugo Bastos
J Chabalier
J Jiang
J Tuikkala
JL Sevilla
L Stein
P Lord
P Lord
P Resnik
PH Lee
RM Othman
RM Riensche
S Cao
T Joshi
X Guo
X Wu
Y Tao
Z Lei
ZH Duan
Publication venue: BioMed Central
Publication date: 01/01/2008
Field of study

Abstract Background Several semantic similarity measures have been applied to gene products annotated with Gene Ontology terms, providing a basis for their functional comparison. However, it is still unclear which is the best approach to semantic similarity in this context, since there is no conclusive evaluation of the various measures. Another issue, is whether electronic annotations should or not be used in semantic similarity calculations. Results We conducted a systematic evaluation of GO-based semantic similarity measures using the relationship with sequence similarity as a means to quantify their performance, and assessed the influence of electronic annotations by testing the measures in the presence and absence of these annotations. We verified that the relationship between semantic and sequence similarity is not linear, but can be well approximated by a rescaled Normal cumulative distribution function. Given that the majority of the semantic similarity measures capture an identical behaviour, but differ in resolution, we used the latter as the main criterion of evaluation. Conclusions This work has provided a basis for the comparison of several semantic similarity measures, and can aid researchers in choosing the most adequate measure for their work. We have found that the hybrid <it>simGIC</it> was the measure with the best overall performance, followed by Resnik's measure using a best-match average combination approach. We have also found that the average and maximum combination approaches are problematic since both are inherently influenced by the number of terms being combined. We suspect that there may be a direct influence of data circularity in the behaviour of the results including electronic annotations, as a result of functional inference from sequence similarity.</p

Universidade de Lisboa: Repositório.UL

A transversal approach to predict gene product networks from ontology-based similarity

Author: A Budanitsky
A Schlicker
A Singhal
Anita Burgun
C Wolting
D Lin
DS Harris
E Agirre
E Camon
E Levy
EB Camon
F Azuaje
FD Gibbons
FJ Field
G Rigau
G Salton
GO Consortium
H Bedrine-Ferran
H Sun
H Wang
IG Wool
J Chabalier
J Chabalier
J Jiang
Jean Mosser
JH Chiang
JM Mariadason
Julie Chabalier
M Gerstein
M Kanehisa
MB Eisen
MD Weiss
ME Brosnan
O Bodenreider
P Joseph
P Khatri
P Resnik
PW Lord
R Baeza-Yates
R Rada
RC Gentleman
T Barrett
T Nakajima
T Yamamoto
TK Jenssen
X Mao
Y Quentin
Publication venue: BioMed Central
Publication date: 01/01/2007
Field of study

Abstract Background Interpretation of transcriptomic data is usually made through a "standard" approach which consists in clustering the genes according to their expression patterns and exploiting Gene Ontology (GO) annotations within each expression cluster. This approach makes it difficult to underline functional relationships between gene products that belong to different expression clusters. To address this issue, we propose a transversal analysis that aims to predict functional networks based on a combination of GO processes and data expression. Results The transversal approach presented in this paper consists in computing the semantic similarity between gene products in a Vector Space Model. Through a weighting scheme over the annotations, we take into account the representativity of the terms that annotate a gene product. Comparing annotation vectors results in a matrix of gene product similarities. Combined with expression data, the matrix is displayed as a set of functional gene networks. The transversal approach was applied to 186 genes related to the enterocyte differentiation stages. This approach resulted in 18 functional networks proved to be biologically relevant. These results were compared with those obtained through a standard approach and with an approach based on information content similarity. Conclusion Complementary to the standard approach, the transversal approach offers new insight into the cellular mechanisms and reveals new research hypotheses by combining gene product networks based on semantic similarity, and data expression.</p

The evolutionary signal in metagenome phyletic profiles predicts many gene functions

Background. The function of many genes is still not known even in model organisms. An increasing availability of microbiome DNA sequencing data provides an opportunity to infer gene function in a systematic manner. Results. We evaluated if the evolutionary signal contained in metagenome phyletic profiles (MPP) is predictive of a broad array of gene functions. The MPPs are an encoding of environmental DNA sequencing data that consists of relative abundances of gene families across metagenomes. We find that such MPPs can accurately predict 826 Gene Ontology functional categories, while drawing on human gut microbiomes, ocean metagenomes, and DNA sequences from various other engineered and natural environments. Overall, in this task, the MPPs are highly accurate, and moreover they provide coverage for a set of Gene Ontology terms largely complementary to standard phylogenetic profiles, derived from fully sequenced genomes. We also find that metagenomes approximated from taxon relative abundance obtained via 16S rRNA gene sequencing may provide surprisingly useful predictive models. Crucially, the MPPs derived from different types of environments can infer distinct, non-overlapping sets of gene functions and therefore complement each other. Consistently, simulations on > 5000 metagenomes indicate that the amount of data is not in itself critical for maximizing predictive accuracy, while the diversity of sampled environments appears to be the critical factor for obtaining robust models. Conclusions. In past work, metagenomics has provided invaluable insight into ecology of various habitats, into diversity of microbial life and also into human health and disease mechanisms. We propose that environmental DNA sequencing additionally constitutes a useful tool to predict biological roles of genes, yielding inferences out of reach for existing comparative genomics approaches

ZENODO

Full-text Institutional Repository of the Ruđer Bošković Institute

NEUROSURGERY ENTHUSIASTIC WOMEN SOCIETY

Benchmarking natural-language parsers for biological applications using dependency graphs

Author: A Bies
AB Clegg
Adrian J Shepherd
Andrew B Clegg
B Rosario
B Srinivas
C Friedman
C Grover
C Grover
D Blaheta
D Gildea
D Klein
D Klein
D Lin
D Lin
D Sleator
DM Bikel
E Charniak
E Tsivtsivadze
EB Camon
EJ Briscoe
G Sampson
G Schneider
G Schneider
IM Goldin
J Carroll
J Carroll
J Finkel
J Xiao
JM Temkin
K Franzén
K Knight
KB Cohen
L Smith
M Collins
M Lease
MC de Marneffe
MP Marcus
N Domedel-Puig
N Ge
O Sanchez
P Merlo
PG Mutalik
S Abney
S Kübler
S Pyysalo
ST Ahmed
T Briscoe
TC Rindflesch
Y Huang
Z Shi
Publication venue: BioMed Central
Publication date: 01/01/2007
Field of study

BACKGROUND: Interest is growing in the application of syntactic parsers to natural language processing problems in biology, but assessing their performance is difficult because differences in linguistic convention can falsely appear to be errors. We present a method for evaluating their accuracy using an intermediate representation based on dependency graphs, in which the semantic relationships important in most information extraction tasks are closer to the surface. We also demonstrate how this method can be easily tailored to various application-driven criteria. RESULTS: Using the GENIA corpus as a gold standard, we tested four open-source parsers which have been used in bioinformatics projects. We first present overall performance measures, and test the two leading tools, the Charniak-Lease and Bikel parsers, on subtasks tailored to reflect the requirements of a system for extracting gene expression relationships. These two tools clearly outperform the other parsers in the evaluation, and achieve accuracy levels comparable to or exceeding native dependency parsers on similar tasks in previous biological evaluations. CONCLUSION: Evaluating using dependency graphs allows parsers to be tested easily on criteria chosen according to the semantics of particular biological applications, drawing attention to important mistakes and soaking up many insignificant differences that would otherwise be reported as errors. Generating high-accuracy dependency graphs from the output of phrase-structure parsers also provides access to the more detailed syntax trees that are used in several natural-language processing techniques

Text mining and manual curation of chemical-gene-disease networks for the Comparative Toxicogenomics Database (CTD)

Author: A Jimeno
Allan Peter Davis
AP Davis
AR Aronson
B Settles
Carolyn J Mattingly
CJ Mattingly
CJ Mattingly
D Hanisch
D Rebholz-Schuhmann
EB Camon
EM Voorhees
H Chen
HM Muller
J Lin
K Bretonnel Cohen
L Lopez-Maury
Lynette Hirschman
M Krallinger
O Gospodnetic
P Corbett
P Corbett
R Hoffmann
R Leaman
R Winnenburg
RB Altman
Thomas C Wiegers
WA Toscano
X Yuan
Y Garten
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Keele Research Repository

Comparative Genomics of the Apicomplexan Parasites Toxoplasma gondii and Neospora caninum: Coccidia Differing in Host Range and Transmission Strategy

Author: A Bahl
A Keeley
A Khaminets
A Khan
A Michelin
A Naguleswaran
AD Stewart
Adam James Reid
AJ Trees
AL Delcher
Alexander J. Trees
AM Cohen
AM Pollard
Amandeep Sohal
AP Sinai
AR Jones
Arnab Pain
B Gajria
BA Butcher
BJ Haas
Boris Striepen
Brian Brunk
C Hertz-Fowler
C Jung
C Su
CM McCann
CS Sohn
David Harris
David S. Roos
DE Neafsey
Dhanasekaran Shanmugam
DK Howe
DL Alexander
DL Swofford
E Eizirik
EB Camon
EK De Silva
GL Mandell
Grant A. Hill-Cawthorne
H El Hajj
HC Davison
HJ Painter
I Korf
I Korf
I Kozarewa
J Ellis
J Felsenstein
J Quackenbush
J Wasmuth
James A. Cotton
James D. Wasmuth
JC Boothroyd
JC Boothroyd
JD DeBarry
JD Dunn
JE Allen
JH Morisaki
JL Jones
JM Dobrowolski
John Parkinson
Jonathan C. Howard
Jonathan M. Wastling
JP Dubey
JP Dubey
JP Dubey
JP Hunn
JP Saeij
JS Barber
L Li
M Hasegawa
M Kanehisa
M Lebrun
M Pertea
M Stanke
M Yamamoto
M Yuda
Mandy Sanders
Matthew Berriman
ME Grigg
Michael A. Quail
Michael E. Grigg
MK Shaw
ML Reese
MM McAllister
MS Behnke
N Friedrich
N Papic
PJ Bradley
PJ Bradley
PW Ewald
RC Edgar
RE Ricklefs
Rebecca Norton
S Anders
S Batzoglou
S Hunter
S Martens
S Martens
S Tavare
S Taylor
Sarah J. Vermont
SB Hedges
SF Altschul
SF Altschul
SJ Fentress
SK Kim
Sophia M. Latham
Stephanie Könen-Waisman
T Carver
T Dowse
T Steinfeldt
Tobias Mourier
VB Carruthers
WH Majoros
YC Ong
Z Ning
Z Yang
Z Yang
Publication venue: Public Library of Science
Publication date: 01/01/2012
Field of study

Toxoplasma gondii is a zoonotic protozoan parasite which infects nearly one third of the human population and is found in an extraordinary range of vertebrate hosts. Its epidemiology depends heavily on horizontal transmission, especially between rodents and its definitive host, the cat. Neospora caninum is a recently discovered close relative of Toxoplasma, whose definitive host is the dog. Both species are tissue-dwelling Coccidia and members of the phylum Apicomplexa; they share many common features, but Neospora neither infects humans nor shares the same wide host range as Toxoplasma, rather it shows a striking preference for highly efficient vertical transmission in cattle. These species therefore provide a remarkable opportunity to investigate mechanisms of host restriction, transmission strategies, virulence and zoonotic potential. We sequenced the genome of N. caninum and transcriptomes of the invasive stage of both species, undertaking an extensive comparative genomics and transcriptomics analysis. We estimate that these organisms diverged from their common ancestor around 28 million years ago and find that both genomes and gene expression are remarkably conserved. However, in N. caninum we identified an unexpected expansion of surface antigen gene families and the divergence of secreted virulence factors, including rhoptry kinases. Specifically we show that the rhoptry kinase ROP18 is pseudogenised in N. caninum and that, as a possible consequence, Neospora is unable to phosphorylate host immunity-related GTPases, as Toxoplasma does. This defense strategy is thought to be key to virulence in Toxoplasma. We conclude that the ecological niches occupied by these species are influenced by a relatively small number of gene products which operate at the host-parasite interface and that the dominance of vertical transmission in N. caninum may be associated with the evolution of reduced virulence in this species

Public Library of Science (PLOS)